Introduction

Since the first cases back in "December" 2019, Covid-19 has significantly impacted the world from an economic, political and social point of view. Different countries have been affected in several ways, according to many different parameters such as the population age, the temperature, the health and social conditions of inhabitants and much more.

Luckily, the discovery of vaccines to prevent the spread of Covid-19 brought light into this darkness and hope of getting soon back to reality. Among the different manufacturers are: Pfizer, Moderna, Johnson & Johnson, AstraZeneca. Countries around the world have different vaccination plans, resulting in different paces of vaccination.

The aim of this project is to analyze and forecast the trend of weekly vaccinations across different countries, depending on specific parameters which play a major role in a country's vaccination progress.

Generative Process

The generative process is the sequential procedure that has to be followed to properly model the interaction of interest, in our case, the people fully vaccinated, which correspond to yt,k. The generative process below corresponds to the final model: an auto-regressive model of order 2.

  1. For each cluster c[1,...,C]:

    a. Draw transition coefficients bcN(bc|μb,σb)

    b. Draw parameter WcCauchy(Wc|μw,σw)

  2. For each country k[1,...,K]:

    a. Draw first observation y1,kN(y1,k|μ0,σ0)

    b. Draw second observation y2,kN(y2,k|β1yt1,σ1)

  3. For each week t[1,...,T]:

    a. Calculate the variable fullyvacct,k=11+e(yfullyt,k)

  4. For each country k[1,...,K]:

    1. For each cluster c[1,...,C]:

      1. For each week t[3,...,T+Tforecast]:

        a. Draw target variable yt,k N(yt,k|(b1,c yt2 + b2,c yt1) * (1 + fullyvacct,k), Wc)

PGM

A Probabilistic Graphical Model (PGM) is as a visual representation of a real-life interaction, that is modelled taking into account the intrinsic uncertainty of the interaction itself.

The picture below shows the PGM representative of the final AR(2) model used to predict the vaccination evolution among countries (see section 2.5). As always, shaded nodes represent observed values, while white nodes represent latent ones.

It can be seen that while most parameters depend on the country k and time step t, the coefficient b and standard deviation W of our target variable yt,k are assigned to every cluster c. This will not be the assumption of the initial basic model, instead this will be the result of different attempts and approaches, which will be explored throughout this notebook.

Schermata 2021-05-27 alle 12.26.57.png

Installations and Imports

First of all, we apply the usual imports.

Functions

Next, we define some functions which will be used throughout this notebook.

Loading Data

The next step is to load the data which will be analyzed in this project. In particular, two datasets will be used:

  1. CovidData, which contains information on the Covid-19 vaccination process across different countries, such as the daily vaccinations, the number of people vaccinated, the manufacturers and more.
  2. ExtraData2, which contains country-specific information related to economical and social aspects, such as the GDP and the % of investment on health.

Data pre-processing

The aim of this section is to ensure that there is a connection between the CovidData and the ExtraData2 datasets. These two datasets will be linked together by the attribute "country". However, prior to this, some cleaning is needed to make both datasets match in terms of country name.

Country names correction

The functions used are defined in the Functions section in case the reader wants to go over them. In any case, the process of data cleaning will be explained step by step.

Before correcting the naming of the countries between the two datasets, the dataframes are filled:

In order to start the mapping, lists containing the countries are created with the function make_countries_list(). As it can be seen, several countries will not be used for the models, since there are many more countries in the ExtraData2 dataset than in the CovidData one.

With the lists of the countries in place, it is seen that not all countries that are in CovidData are in ExtraData2 with the function check_countries(). This could be possible due to commas, spaces or abreviations in the naming of the countries in one or both of the datasets. A good practice is to display all the names:

Due to the nature of the dataset, the aproach followed is to manually correct the country names.

As we can see, now the countries that do not match is an empty array, meaning that the countries in between the two datasets are identical. Therefore, we can proceed with the clustering of the countries.

PCA Clustering

To cluster the countries into 5 groups, a Principal component analysis is done. In this way, we can tell which are the most important features and maintain the explainability along the process.

Before starting the process, we will standardize the data:

To get a clear idea of how the dataset's features are correlated, it is a good idea to plot the correlation matrix:

With this in mind, we perform the PCA and look into the importance of every feature:

To reduce complexity but keep the explainability of the dataset, the threshold is set in 70%. This threshold is already achieved with the first 3 principal components.

Let's look at the contribution of every attribute:

Having looked into what attributes are key for the interpretability of the dataset, now the clustering can be performed. The data is given the adequate format below:

Now the clustering is performed. The KMEANS algorithm has been chosen, since it is very practical and straightforward in choosing the number of clusters, in our case, 5.

The main reason why the PCA is used is to increase the interpretability and robustness of the clustering.

Regarding the interpretability, it would be counterintuitive to think it is increased, since moving away from the real attributes reduces interpretability. However, on the other hand, doing a PCA allows the data points (the countries) to be projected into a 2-dimensional space so one can visually understand the clusters.

Finally, the countries per cluster can be seen below, as well as a World map displaying the countries* by colors.

Unfortunately the tool used does not recognise all countries as such, and, for instance, coutnries such as Laos, Monaco or Andorra are not displayed.*